IRIT at TREC KBA 2014
نویسندگان
چکیده
This paper describes the IRIT lab participation to the Vital Filtering task (also known as Cumulative Citation Recommendation) of the TREC 2014 Knowledge Base Acceleration Track. This task aims at identifying vital documents containing timely new information that should help a human to update the profile of the target entity (e.g., Wikipedia page of the entity). In this work, we evaluate two factors that could detect vitality. The first one uses a Language Model to learn vitality from a sample of vital documents, and the second leverages the bursts of documents in the stream. Obtained results are presented and discussed. 1 Presentation of the task The aim of the Vital Filtering task is to identify vital documents for a given entity. These documents should help knowledge base editors to update the profile of the entity (e.g. its Wikipedia article). A specially filtered subset of the full TREC 2014 StreamCorpus was provided for use in the 2014 TREC KBA Track. It consists of about 20 million timestamped documents from several sources (News, Social, Forum, Blog, etc.) having a size of 639 GB (compressed). The target topic set is composed of 71 entities including persons, organization and facilities. Each entity E has a training end time (that we denote by TTRend(E)). Documents before TTRend(E) can be used as training data and those after are used as test documents. Document annotations were done as follows: a document is considered as Vital if it contains a timely relevant information about the entity, Useful if it contains relevant but not timely information about the entity, Neutral if it mentions the entity without providing any information about it, and Garbage if it does not mention the entity. The official metric for the task is the hF1 , i.e. the maximum macro-averaged F1 measured over all confidence cutoff i (i ∈ [0, 1000] where 1000 corresponds to the highest level of confidence and 0 corresponds to the level in which all documents are kept). 2 A two-step vitality filtering approach Entity-centric document filtering methods have been classified into two categories: classification and ranking [1]. Unlike our past participation based on a classification approach [2], we propose this year a ranking based vitality filtering approach which involves two main steps: filtering and scoring. 1 http://trec-kba.org/trec-kba-2014/vital-filtering.shtml 2 http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html#kba-streamcorpus-2014v0 3 0-kba-filtered 2.1 Filtering step The filtering step can be seen as a way to eliminate non-relevant documents. In this step, for each hour, we select only the topH documents that match the full entity name based on the following score: Score(d,E) = ∑ t∈E tf(t, d) |d| Where d is a document, and E represents the full entity name extracted from the given topic url. For example, for the topic https://kb.diffeo.com/Jeff Mangum, E= Jeff Mangum. tf(t, d) is the term frequency of term t in document d. In addition, to reject documents matching the entity but more likely to be spam, we apply the following two filters: – An enumeration filter that rejects documents mentioning the entity only in an abusive list of more than n entities such as E1, . . ., Et, . . ., En. – A links filter that rejects documents having more than n hyper-links. 2.2 Scoring step Documents that pass the filtering step are ranked using two vitality factors: a LanguageModel-based factor and a Burst-based factor. 2.2.1 Estimating vitality with the Language-Model-based factor We use a Language Model to estimate a vitality model for each entity. With this model, we want to detect “vital” words that help identify upcoming vital documents. As vitality is unknown a priori, we leverage the set of training vital documents (before TTRend(E)). We believe that we can find some common features between vital documents of the training set and a new vital one. Formally, given an entity E and a sample of n vital documents vdi, we estimate the entity vitality model θVE as follows: P (t|θVE ) = ∑n i=1 tf(t, vdi)df(t) ∑n i=1 |vdi| (1) We note that P (t|θVE ) is not a strict probability distribution tf(t, vdi) is the term frequency of term t in document vdi df(t) = 1 log( m ) is used to boost terms often appearing in vital documents, where m represents the number of vital documents for E containing term t. The vitality score of a new incoming document d with respect to an entity E is evaluated as follows: ScoreLM (d,E) = ∏ t∈topk(θVE ) P (t|θd) P (t|θVE ) (2) Where topk(θVE ) is the set of top k terms in θVE , and P (t|θd) is estimated using a Dirichlet Smoothing as follows: P (t|θd) = tf(t, d) + μ tf(t,C) ∑ t′∈C tf(t,C) |d|+ μ (3) tf(t, d) is the term frequency of term t in the document d tf(t, C) is the term frequency of term t in the collection C C is the reference collection composed from early stream documents before TTRend(E) μ is a smoothing parameter used to avoid null probabilities 2.2.2 Estimating vitality with the burst-based factor We assume that when new information is published about a given entity, this might lead to an accelerated growth of the number of documents describing this new information. Our idea is to consider that a document is vital regarding an entity E if it appears in a burst of documents that match the entity E. We hypothesize that the higher the number of matching documents is in a short period, the higher the probability of having vital documents will be. We leverage this idea in our burst-based factor. Formally, for a new incoming document d, we evaluate its vitality with respect to E as follows: ScoreBurst(d,E) = (1− e −x 2 σ ) ∗ ∏
منابع مشابه
K2U at TREC 2014 KBA Track
There are two types of nodes, called “spouts” and “bolts”. A spout is a source of streams (sequences of tuples). In case of the KBA track, a spout would read document data from the provided KBA corpus and emit them as a stream. A bolt receives any number of input streams, does some processing, and may emit new streams. For the KBA track, bolts would determine whether inbound documents from the ...
متن کاملSCU at TREC 2014 Knowledge Base Acceleration Track
In this paper, we present our system we developed at Santa Clara University to address the SSF task in TREC KBA 2014. We used the pattern matching method to extract slot values for interested entities from relevant passages. We improved the approach we used last year to enhance the performance. Our system consists of the following steps: processing filtered corpus, retrieving relevant passages,...
متن کاملEvaluating Stream Filtering for Entity Profile Updates in TREC 2012, 2013, and 2014
The Knowledge Base Acceleration (KBA) track ran in TREC 2012, 2013, and 2014 as an entitycentric filtering evaluation. This track evaluates systems that filter a time-ordered corpus for documents and slot fills that would change an entity profile in a predefined list of entities. Compared with the 2012 and 2013 evaluations, the 2014 evaluation introduced several refinements, including high-qual...
متن کاملBIT and Purdue at TREC-KBA-CCR Track 2014
This report summarizes our participation at KBA-CCR track in TREC 2014. Our submissions are generated in two steps: (1) Filtering a candidate documents collection from the stream corpus for a set of target entities; and (2) Estimating the relevance levels between candidate documents and target entities. Three kinds of approaches are employed in the second step, including query expansion, classi...
متن کاملPRIS at TREC 2012 KBA Track
Our system to KBA Track at TREC2012 is described in this paper, which includes preprocessing, index building, relevance feedback and similarity calculation. In particular, the Jaccard coefficient was applied to calculate the similarities between documents. We also show the evaluation results for our team and the comparison with the best and median evaluations.
متن کامل